Automated Huntington’s Disease Prognosis via Biomedical Signals and Shallow Machine Learning

Huntington’s disease (HD) is a rare, genetically-determined brain disorder that limits the life of the patient, although early prognosis of HD can substantially improve the patient’s quality of life. Current HD prognosis methods include using a variety of complex biomarkers such as clinical and imaging factors, however these methods have many shortfalls, such as their resource demand and failure to distinguish symptomatic and asymptomatic patients. Quantitative biomedical signaling has been used for diagnosis of other neurological disorders such as schizophrenia, and has potential for exposing abnormalities in HD patients. In this project, we used a premade, certified dataset collected at a clinic with 27 HD positive patients, 36 controls, and 6 unknowns with electroencephalography, electrocardiography, and functional near-infrared spectroscopy data. We first preprocessed the data and extracted a variety of features from both the transformed and raw signals, after which we applied a plethora of shallow machine learning techniques. We found the highest accuracy was achieved by a scaled-out Extremely Randomized Trees algorithm, with area under the curve of the receiver operator characteristic of 0.963 and accuracy of 91.353%. The subsequent feature analysis showed that 60.865% of the features had p<0.05, with the features from the raw signal being most significant. The results indicate the promise of neural and cardiac signals for marking abnormalities in HD, as well as evaluating the progression of the disease in patients.


I. INTRODUCTION
Huntington's disease (HD) is a neurodegenerative disease characterized by motor, cognitive, and/or behavioural disturbances, which manifest themselves in symptoms at a mean onset of 30-50 years of age [1]. HD is autosomal dominant, and the genetic cause is repeats in cytosine-adenine-guanine (CAG) repeats in the Huntingtin (HTT) gene, with a clear inverse relationship between the number of CAG trinucleotide repeats and the age of HD onset [1]. The CAG repeats causes mutational expansion of polyglutamine in HTT proteins, resulting in protein folding restriction, with prevalence of the mutation varying widely between geographical regions with an average worldwide prevalence of 2.7 HD positive per 100,000 [1], [2]. The disease is most prevalent in Western populations, however, at a rate of 10.6-13.7 HD-positive individuals per 100,000 people-with almost 30,000 HD patients in the United States [3].
HD currently has no cure, although new clinical approaches such as Antisense oligonucleotide therapy (AOT) have shown some promise for reducing levels of the mutation [3]. Early detection of HD, however, is crucial for clinical trials and intervention in the earliest stages of the disease, which can benefit the quality of life of the patients [4]. Current diagnosis techniques rely on a combination of behavioural, clinical, imaging, and genetic factors as well as a complex analysis into the HTT gene in the family. Such approaches have several drawbacks, including being time and resourcecostly, as well as failing to distinguish between asymptomatic and symptomatic carriers of the HTT mutation [5]. Many symptoms that mimic HD also exist, presenting a prognostic challenge in certain patients, which can be exacerbated in poorer or rural hospitals [5]. The relative rarity of the disease has also contributed to a shortfall in understanding of HD at a very specific, symptomatic level, signaling the need for more robust diagnostic solutions.
Several past studies have examined HD diagnosis in detail. Paulsen et al. (2008) studied hundreds of HD genepositive patients in an attempt to predict Huntington's progression decades in advance [6]. Each patient underwent several lengthy trials measuring genetic, neurobiological, clinical, and behavioural factors through blood tests, neurological and cognitive exams, psychological questionnaires, brain imaging, and more. The study found that the time of HD diagnosis could best be predicted by clinical and neu-roimaging markers (p<0.0001). Paulsen (2009) examined past literature on MRI and fMRI for Huntington's disease prognosis [7]. While there are limitations, the study describes the potential for magnetic resonance imaging for detective sensitive changes resulting from the earliest progressions of HD. Lastly, Mason et al. (2018) also focused on neuroimaging markers for early diagnosis of HD, with the objective of correcting for the imprecisity of statistical models based on motor symptoms [8]. Through resting-state and structural fMRI, the researchers trained a support-vector machine on a cohort of 19 HD-positive patients and 21 controls. The researchers found that a holistic imaging approaches resulted in high classification accuracy (p<0.03).
There is also literature on the altered electrical activity throughout the body in Huntington's patients, particularly in electroencephalography (EEG) and electrocardiography (ECG), both tests that measure electrical activity in various regions through sensitive electrodes. Scott et al. (1972) measured EEG levels at all voltages and frequencies in the brain in HD-positive and control patients [9]. They found that the presence of low voltage EEG, particularly in the delta and theta frequency bands, specified cortical atrophy often found in HD, while low voltage EEG was rarely present in control patients. Cankar et al. (2018) tested ECG in HD-positive and healthy control patients, since one of the most common causes of death in HD patients is heart disease [10]. The researchers found that HD-positive individuals exhibit higher voltage variability in the ECG regions among standard QT intervals (p<0.01), signifying cardio-electric remodeling in HD patients.
Biological signaling can be carried out quickly at low cost, and quantitative EEG (qEEG) has already shown promise for early detection of other neurodegenerative diseases such as Alzheimer's disease and Parkinson's disease [11], [12]. In this study, one of the first of its kind, we will test classical machine learning trained on EEG, ECG, and functional near-infrared spectrography (fNIRS). Then, the model will be optimized and evaluated, with the goal of assessing the potential for neural and cardiac signaling for HD prognosis.

A. DATASET OVERVIEW
Data for this study came from an HD and controls dataset published publicly by Lancaster University [13]. In the data collection process, 69 participants were recruited from the Neurological Clinic in Ljubljana, Slovenia to undergo several tests. As seen in Table 1, of these participants, 15 were symptomatic HD (SHD), 12 were pre-symptomatic HD (PHD), 36 were health controls and 6 were unclassified. Preliminary testing showed that the unclassified patients had extremely similar features to the control patients, and were thus considered controls for the remainder of this study. Over the course of 30 minutes (then cut to 20 minutes), EEG, ECG, fNIRS, respiration, and blood flow data was recorded via a variety of instruments, and the room and patient forehead temperature was also taken during this time. The EEG signal was recorded through V-Amp headset with 16 electrodes (10-20 system) at 1 kHz, with the participant sitting in a chair with eyes open. The specific electrodes used were the C3, C4, Cz, F3, F4, Fp1, Fp2, O1, O2, P3, P4, P7, P8, Pz, T7, T8 placements. The ECG signal was recorded over a single channel at 1.2 kHz with sensors placed on the shoulders and lower left rib. The fNIRS data was taken with 11 optodes (10-20 system) at 31.25Hz in placements N1, N2, N3, N4, N5, N6, N7, N8, N9, N10, and N11, each with deoxygenated and oxygenated channels (giving 22 channels in total). The data collection protocols in the dataset used in this study were approved by the commission of the Republic of Slovenia for Medical Ethics.

B. EEG, ECG, AND FNIRS SIGNAL PROCESSING
To be used for HD classification, the EEG, ECG, and fNIRS signals had to be preprocessed and features extracted, performed mostly using MNE (https://mne.tools/) [14]. As seen in Figure 1, signal preprocessing was divided into formatting/cutting, filtering, segmentation, and finally feature extraction. First, the EEG, ECG, and fNIRS data were separated from the rest of the data and concatenated into their own respective CSV files from the original MAT files.
Next, the data filtered and segmented. First, the data had to be converted into an MNE object, which was done in accordance with the 10-20 standard system (which the electrodes are placed in) for the EEG and fNIRS data. The EEG signal was filtered with EEG reference at a low pass of 0.5 Hz and a high pass of 45 Hz, in order to place an emphasis on the lower frequencies around the delta, theta, and alpha frequency bands, in which abnormalities can be indicative of HD pathology [15]. The ECG data was filtered with low pass of 0.05 Hz (the lowest frequency recorded by the machine), and a high pass of 100 Hz to eliminate high-frequency noise [16]. The fNIRS signal was filtered at a low pass of 0.2 Hz and high pass of 1.5 Hz to remove artifacts caused by blood pressure fluctuations [17]. The filtered signals for EEG, ECG, and fNIRS were then segmented into 5s epochs with 1s overlap, with the 1200s data for each patient being converted into 300 epochs-with 4s of unique data in each. For each patient, the bad epochs were dropped and the data was normalized by subtracting the mean of each channel. The data arrays for each signal type for all patients were saved in NPY format. An example of the EEG signal for each filtered band (same frequency bounds as shown for EEG in Table 2) for a healthy patient and HD patient are shown in Figure 2. As can be seen, the signal for the HD patient has generally lower amplitudes (in microvolts). In similar fashion, a filtered ECG signal example for healthy and HD patient can be seen in Figure 2, where the ECG bands are more consistent in amplitude and more periodic for the healthy control than in the HD patient. Figure 2 shown a similar comparison for the filtered fNIRS signal, where both bands more closely follow each other for the healthy patient.
Finally, the features could be extracted from the processed signals. As shown  in Table 2, standard statistical and slope values were first extracted from the signals of each patient, namely kurtosis, coefficient of variation, skewness, 1st and 2nd difference of mean, 1st and 2nd difference of max, slope mean, slope variance, and Higuchi fractal dimension (HFD). They were extracted on a per-channel basis, thus amounting to 160 features from EEG, 10 features from ECG, and 220 features from fNIRS. HFD can give information about the complexity of neuronal activity in various neurophysiological conditions, and is simple to calculate [18]. Next, Hijorth parameters were extracted from each channel: activity, mobility, and complexity, totaling to 48 features per epoch for EEG, 3 features for ECG, and 66 features for fNIRS. Activity represents the variance or power of the signal over a certain epoch/segment in time domain, and indicative of the power spectrum surface over the frequency domain. Activity can be calculated for signals simply through: Mobility indicates the mean frequency over an epoch in the time domain, also represented as the proportion of standard deviation (SD) over the power spectrum, calculated through the following equation: Lastly, complexity estimates the bandwidth of the signal over an epoch, essentially the average power of the second derivative of the signal, represented in: Several wavelet features were also extracted per channel, calculated using PyWavelets (https://pywavelets.readthedocs.io/) [19]. To calculate each feature, a discrete wavelet transform (DWT) was performed and the coefficients from the wavelet function stored. For EEG and fNIRS, the Coiflet (coif1) wavelet was used, and for ECG, the Daubechies (db4) wavelet was used for features. Then, the approximate and detailed mean, SD, energy, and entropy were all calculated from these coefficients over a time segment, amounting to 128 EEG features, 8 ECG features, and 176 fNIRS features per epoch. Energy (4) is the summation of the squared signal,   (5) is the measure of regularity/fluctuation in a time segment, shown in the following equations: The power-spectral density (PSD) values were calculated for a variety of bands based on frequency bounds shown in Table 2. The total amount of features amounted to 80 for EEG, 5 for ECG, and 44 for fNIRS. The ECG PSD bands were chosen based on the peak powers of typical ECG signals, and the fNIRS PSD bands were chosen based on respiration and heartbeat frequencies [20], [21]. PSD indicates the power levels over the frequency domain in each component in a signal segment, which tells the model the range of power over various signal frequencies, making it an effective predictor for abnormalities [22]. Welch's method of calculating PSD is an alternative approximation method to the Fast-Fourier Transform (FFT), which averages periodograms over time on a per-segment basis, represented through the equation [23]: An example of PSD over the frequency domain for each signal type is shown in Figure 3. The final data array amounted to 20631 epochs from 69 patients with 948 features per epoch. A group array was also made to ensure that the epochs of each patient were trained or tested together, preventing overfitting.

C. MODEL SELECTION AND CONSTRUCTION
Classical machine learning algorithms followed by deep neural networks were used for classification tasks based on the extracted EEG features. All classical machine learning algorithms were implemented using Scikit-learn (https://scikitlearn.org/), including Random Forest, Extremely Randomized Trees, Logistic Regression, Support-vector machines, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA) [24]. All models were tested using 10-fold cross validation with the group array to minimize overfitting and underfitting, and scored with accuracy, precision, recall, receiver operating characteristic (ROC-AUC), and F1 score. The best models were then found and hyperparameter-tuned through random grid search.
Random Forest is an ensemble method that uses many randomized decision trees, each of which are trained/fit on sub-samples of the dataset and subsets of the features at each split point, with their results averaged. This robust approach of different features at each tree split point has been shown to improve accuracy, reduce variability and reduce overfitting. Random Forests have been popular in qEEG analysis, such as in differentiating schizophrenia patients from controls [25].
Extremely Randomized trees (ERTs) are an ensemble method in of themselves, and is similar to Random Forest in that ERTs take subsets of the features to train each subtree, however contrast in that ERTs train each sub-tree on the entire dataset instead of a sub-sample. ERTs can also select either a random split or the best split for the optimal split, unlike the greedy algorithm of Random Forests.
Logistic Regression is effectively a linear regressor that models the data using a sigmoid function instead of a linear function. The algorithm itself remains similar to a linear regressor until the decision threshold, where in binomial logistic regression, the probabilistic result is reduced into a binary output (0 or 1). Support-Vector machines (SVM) use hyper planes to effectively partition data into classes, and are thus better suited for classification rather than regression. Based on the number of features, the algorithm finds, in an N-dimensional plane, an equation of N-1 dimensions that partitions the data into various classes, in this case two classes (0 or 1), through minimizing the distance between the equation graph and the individual data points (functional margin). Preliminary tests showed that the polynomial kernel outperformed other kernels, such as sigmoid and radial-basis-function, and thus was used for the remainder of the study.
Linear Discriminant Analysis uses a similar method to SVM, attempting to draw planes to divide sets of classes in the hyperplane of features, however uses different criteria. LDA attempts to maximize the mean distance between the points and the hyperplane, and minimize the variance between each class. Thus LDA performs similarly to SVM. QDA uses the same criteria, but with a degree-2 discriminant plane rather than linear.

A. PRELIMINARY MODEL RESULTS
The models outlined were first run with default hyperparameters testing a variety of metrics with k=10 on a personal computer with Intel Core i7-3930k CPU and NVIDIA GeForce RTX-2060 GPU. The concatenated feature vectors were created along with the label and group arrays, and fed into cross validation. The results of the models are shown in Table 3. As can be seen, Extremely Randomized Trees (ERTs) performed best with Random Forest close behind on all metrics except recall and f1 score, where the Random Forest outperformed and tied the ERTs, respectively. In the next section, the estimators of the Random Forest and ERT models were scaled out.

B. OPTIMIZED RESULTS
Here, the ERT and Random Forest algorithms were scaled out to better accommodate for the large feature count and sample size. The estimator counts were increased to n=1000 estimators in an experiment to evaluate the performance benefit of large estimator counts with many features. The two tuned models were them run on the same cross validation evaluator with k=10 and the full dataset, the results for which   Table 4. Once again, the Extremely Randomized Trees algorithm performed the best, again on all metrics except for recall. The results for the tuned models were visualized in a variety of ways. First confusion matrices showing the false-positives, true-positives, false-negatives, and true-negatives were created from the prediction arrays of the ERT and Random Forest evaluation, shown in Figure 4. Next, area-under-the-curve (AUC) plots were created from the false-positives and true-positive rates to visualize the receiver operating characteristic, as well as from the precision and recall for the precision-recall curve, shown in Figure 5.

C. STATISTICAL ANALYSIS
Statistical significance was also calculated via a variety of tests for the features using Statsmodels (https://www.statsmodels.org/) [26]. For the entire feature array, the R 2 coefficient of determination-representing the proportion of variation of the dependent variable explainable by the independent variable-was calculated to be 0.89. The F-statistic was calculated to be 275.9 and the probability-Fstatistic was 0, representing the probability that the observed results occurred by chance, ie. approximately 0%. The loglikelihood, or an estimate for the goodness-of-fit for the data and labels, came out to be 8412. More specific values, such as p and t values, were calculated on a per-feature basis with the results summarized in Table 5.
Feature importance was also calculated in order to reveal information into the most important values for predictions. Since performance was not particularly important, the feature importance was calculated through a default-hyperparameterized ERT. The data was split up on the basis of various attributes, such as the general group of feature, the type of signal, the specific channels, and bands. The results are shown for each in Figure 6.

IV. DISCUSSION
Overall, the procedure seem to have been chosen appropriately given the observations. For example, the EEG signal visualization in Figure 3.a, all of the power of the signal seems to be contained in Hz<50, with severe fluctuations at Hz=50. In the ECG signal, almost all the power is contained within Hz< 60, and most of the power of fNIRS is contained before 4 Hz-however most of the fNIRS activity occurs within Hz<2.0. These observations did not change substantially with different patient examples, indicating that the filtering bounds chosen for the three signal types were appropriate. Additionally, the choice of k-folds of 10 for 10fold cross validation seems to have resulted in a model that is neither extremely underfitted nor extremely overfitted, as is usually the case with such a value of k.
In this study, the relationship between EEG, ECG, and fNIRS and the state of a patient as HD-afflicted or healthy was statistically significant and consistent. In terms of statistical significance, as shown in Table 5, 577 out of the 948 total features had p<0.05, signifying statistical significance, with 357 of them having negligible p values. For these features, the null hypothesis can be rejected, and the null hypothesis can also safely be rejected when questioning the significance of the feature choice as a whole. T values-the difference of mean between groups, used to assess covariance-was high for the most statistically significant features.
The feature importance observations reveal some insight into the values most indicative of HD. Specifically, while the EEG signal was the most significant signal type by a small margin, the error bars have too large an overlap to conclude the feature importance of fNIRS in comparison to EEG-however ECG can be concluded to be the least significant signal type. In terms of feature type, the statistical and wavelet features were the most significant, which could call for more experimentation into various wavelet transform functions. Another interesting observation is that the P8 channel type was the most significant while the P7 channel had one of the lowest importances, despite both electrodes being symmetrical on either side of the head-which could indicate an asymmetry of HD. The feature importances in the EEG PSD frequency bands were mostly as expected, except for the apparent unimportance of the theta band, which could indicate some amount of noise in the theta band frequencies.
The R 2 coefficient of determination showed that 89% of the variation of the dependent variable (HD diagnosis) was predictable by the independent variable(s), which follows closely with the accuracy of the two best models, Extremely Randomized Trees and Random Forest. The probability fstatistic also shows a negligible probability of lucky accuracy. Overall, the feature choice reflects a large amount of information about the patient's pathology with HD.
Holistically, the models demonstrate the potential for biological signals such as EEG, ECG, and fNIRS for marking the path of Huntington's disease. The consistent superiority of the Extremely Randomized Trees algorithm was surprising considering the lack of its use in previous literature, such as in Schizophrenia [27]. The LDA also performed better than expected-achieving a high ROC AUC of 0.875-in comparison to its more complex QDA, suggesting more simple clustering divisions achievable by linear separators. The results of this study provide for a significant performance improvement over Odish et al. (2018), which used power spectral density of EEG data to classify Huntington's disease, achieving 83% accuracy and 0.9 ROC AUC [28].
There are several limitations to this study. First, despite the analysis into feature importance, it is still difficult to predict which features will scale well into larger or more comprehensive datasets with more patients and electrode channels. Second, the 6 unclassified patients still prompt some investigation, although removing them from the dataset altogether did not notably affect performance. Third, this study did not evaluate the various markers tested at differing stages of the disease, which would have likely been useful in assessing the efficacy of HD patient monitoring through biological signals. These are all topics for future investigation.

V. CONCLUSION
In this study we consolidated, preprocessed, and extracted features from a variety of biological signals from a complete dataset. We then formatted, filtered, segmented, and extracted features from the data, pertaining to hijorth, statistical, slope, wavelet, and power spectral features which had mostly statistically-significant correlations to HD positivity. We then ran a variety of machine learning algorithms and found high accuracy and ROC AUC with tree ensemble algorithms. We found the most significant features to be from EEG and fNIRS data, with statistical and wavelet features, with ECG being less useful. While there is still much research to be done in this area, this study provides some insight into the usefulness of neural and cardiac signaling for Huntington's disease prognosis, and could potentially be built upon with deeper or more powerful models, larger or more expansive datasets, or more thorough feature selection. Topics for future research could include evaluating biological signal features at various stages pre and post-Huntington's disease onset, as well as exploration of other neural signaling types such as MEG.